Week 3: Retrieval Augumented Generation (RAG) - Part 1

Applied Generative AI for AI Developers

Amit Arora

What is RAG?

RAG = Retrieval Augumented Generation
A generative AI approach where the model combines external knowledge retrieval with text generation to provide more accurate and contextually rich responses.

Why RAG?

  • Augments LLM responses with relevant context: Instead of relying solely on the LLM’s training data, RAG retrieves and incorporates specific, up-to-date information into responses.

  • Helps ground responses in factual information: By providing relevant context from trusted sources, RAG ensures responses are based on actual facts rather than model-generated content.

  • Reduces hallucinations: With access to specific, retrieved information, the model is less likely to generate incorrect or fabricated responses.

  • Enables use of private/proprietary data: Organizations can leverage their internal documents, knowledge bases, and proprietary information that wasn’t part of the LLM’s training data.

  • Provides source attribution: RAG systems can track where information comes from, making responses more transparent and verifiable.

Simple RAG Architecture

RAG simple

Key Components:

  • Document Processing: Converts raw documents into chunks and creates embeddings for efficient retrieval.
  • Vector Storage: Stores document embeddings and enables similarity search.
  • Query Processing: Converts user questions into embeddings and finds relevant documents.
  • Response Generation: Combines retrieved context with LLM capabilities to generate accurate answers.

Building a Basic RAG App

  1. Prepare documents: Clean and preprocess your source documents, removing irrelevant content and standardizing format.

  2. Create embeddings: Convert text chunks into numerical vectors using embedding models like BGE-large-en-v1.5 (available on Hugging Face), Amazon Titan embeddings, OpenAI’s ada-002 or Cohere’s embed-multilingual.

  3. Store in vector database: Upload embeddings to a vector store like Pinecone, Weaviate, or FAISS for efficient similarity search.

  4. Process user query: Convert the user’s question into an embedding using the same embedding model.

  5. Retrieve relevant context: Perform similarity search to find the most relevant document chunks.

  6. Generate response: Combine retrieved context with an LLM prompt to generate an accurate, contextual response.

Chunking Strategies

Reference: Chunking techniques with LangChain and LllamaIndex

  • Document segmentation approaches: Choose between fixed-size chunks, semantic chunking, or paragraph-based splitting depending on your content structure.

  • Chunk size considerations: Balance between too large (dilutes relevance) and too small (loses context) - typically 256-1024 tokens works well.

  • Overlap between chunks: Include some overlap (10-20%) between consecutive chunks to maintain context across boundaries.

  • Maintaining context: Preserve important metadata and hierarchical information when splitting documents.

  • Structured vs unstructured data: Adapt chunking strategy based on whether you’re dealing with free text, tables, or structured documents.

Embeddings Deep Dive

Key Considerations:

  • Model selection criteria: Consider factors like accuracy, speed, cost, and dimension size when choosing an embedding model.

  • Dimensionality impact: Higher dimensions can capture more information but increase storage costs and retrieval time.

  • Multi-lingual support: Choose models like Cohere multilingual or Amazon Titan if your application needs to handle multiple languages.

  • Domain-specific needs: Consider fine-tuning embedding models for specialized domains like medical or legal text. Finet-tuning using Sentence Transformers.

Vector Databases

Features to Consider:

  • Scalability: Ability to handle millions or billions of vectors efficiently.

  • Query performance: Fast similarity search with support for approximate nearest neighbors (ANN) algorithms.

  • Similarity search algorithms: Support for different distance metrics (cosine, euclidean) and indexing methods.

  • Metadata filtering: Ability to combine vector similarity search with metadata filters.

  • Cost considerations: Balance between hosting costs, query costs, and storage requirements.

Examples of Vector Databases

  • Pinecone:
    • Scalable and high-performance vector database.
    • Designed for real-time search with high availability and easy integration.
    • Offers fully managed services with automatic scaling and monitoring.
  • Weaviate:
    • Open-source vector search engine with support for hybrid search (text + vector).
    • Schema-free or schema-driven, flexible for various data types.
    • Built-in ML model hosting and extensible through modules like transformers.
  • Milvus:
    • Cloud-native vector database optimized for high-throughput and low-latency vector retrieval.
    • Open-source with strong community support and enterprise-grade features.
    • Supports massive-scale data management for AI and analytics applications.

Examples of Vector Databases (contd.)

  • Qdrant:
    • Feature-rich, open-source vector database with support for filtering and hybrid search.
    • Integrates easily with other tools like LangChain and Python.
    • Designed for both small-scale and production-grade deployments.
  • Vespa:
    • A scalable engine supporting full-text, vector, and structured data search.
    • Highly customizable ranking functions for advanced retrieval tasks.
    • Enterprise-grade features, including sharding and high-availability.
  • Redis (with RedisAI):
    • Extends Redis key-value store to support vector similarity search.
    • Real-time capabilities with minimal latency and optional AI integration.
    • Excellent choice for lightweight applications or adding vector search to existing Redis setups.

Examples of Vector Databases (contd.)

  • FAISS (by Meta AI):
    • A library rather than a traditional database, optimized for similarity search and clustering.
    • Ideal for applications requiring high-speed vector operations on dense datasets.
    • Limited to in-memory computation but extremely efficient.
  • OpenSearch:
    • Open-source search and analytics platform with vector search support using k-NN.
    • Enables hybrid search across text, vector embeddings, and metadata.
    • Strong integration with Elasticsearch and big data ecosystems.
  • Chroma:
    • Lightweight and developer-friendly embedding store.
    • Designed for rapid prototyping and easy integration with LLM applications.
    • Optimized for smaller-scale use cases but growing in capabilities.
  • PostgreSQL (with pgvector):
    • Extends PostgreSQL to store and retrieve high-dimensional vector embeddings.
    • Leverages PostgreSQL’s powerful query capabilities, indexes, and extensions.
    • Suitable for teams already using PostgreSQL for traditional relational data.

How to Choose a Vector Database

  • Data Has Gravity:
    • Where is your data stored now? Choose a database that minimizes data migration and latency.
    • Some databases, like PostgreSQL with pgvector, integrate well with existing relational systems.
  • Retrieval Latency:
    • Different databases offer varied performance levels for real-time queries.
    • Evaluate latency requirements, especially for user-facing applications.
  • Vector Dimensions Supported:
    • Ensure the database supports the dimensions of your vector embeddings.
    • Some systems may perform poorly or be incompatible with high-dimensional vectors.

How to Choose a Vector Database (contd.)

  • Ingestion Pipelines and Connectors:
    • Seamless integration with your existing pipelines can save significant development time.
    • Look for native connectors or APIs that fit into your workflows (e.g., LangChain, Python, or cloud platforms).
  • Search Algorithms:
    • Does the database support the algorithms you need (e.g., k-NN, HNSW, or hybrid search)?
    • Certain use cases may demand customizable ranking or filtering capabilities.

Query Processing

Advanced Techniques:

  • Query rewriting: Reformulate user queries to improve retrieval performance, often using an LLM to generate better search terms.

  • Query expansion: Generate multiple variations of the query to increase the chance of finding relevant information.

  • Entity extraction: Identify and use key entities from the query for more focused retrieval.

  • Hybrid search: Combine semantic (embedding-based) and lexical (keyword-based) search for better results.

  • Query decomposition: Break complex queries into simpler sub-queries that can be processed independently.

Evaluation & Benchmarking

Metrics:

  • Mean Reciprocal Rank (MRR): Measures how early in the results list the first relevant document appears.

  • Mean Average Precision: Evaluates precision at different recall levels, providing a single score for overall retrieval quality.

  • Normalized Discounted Cumulative Gain: Measures the quality of ranking by considering both relevance and position.

  • Recall: Assesses whether all relevant documents are retrieved from the collection.

RAG Pipelines

Popular Frameworks:

  • LlamaIndex: Provides high-level abstractions for building RAG applications with features like structured data handling and custom retrievers.

  • Haystack: Offers modular components for building production-ready search and RAG systems with strong evaluation capabilities.

  • Langchain: Enables building complex chains of operations for RAG applications with extensive integration options.

  • Amazon Bedrock Prompt Flows: Managed service for building and deploying RAG applications with integration to AWS services.

Putting it all together

RAG full picture

References

Category Link Description
RAG https://arxiv.org/pdf/2005.11401 Paper: Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks
RAG https://arxiv.org/pdf/2412.15605v1 Paper: Cache-augmented generation (CAG) as an alternative to RAG
RAG https://arxiv.org/pdf/2401.15884 Paper: Corrective Retrieval Augmented Generation
RAG https://arxiv.org/html/2409.13731v3 Paper: KAG: Boosting LLMs in Professional Domains via Knowledge Augmented Generation
RAG https://x.com/akshay_pachaar/status/1875520939536142656 Traditional RAG Vs Graph RAG
RAG https://www.dailydoseofds.com/p/traditional-rag-vs-hyde/ Traditional RAG Vs HyDE
RAG https://www.theunwindai.com/p/build-a-corrective-rag-agent Build a corrective RAG application
RAG https://x.com/Aurimas_Gr/status/1879148810158452777 Challenges and components of production-grade RAG AI systems
RAG https://x.com/akshay_pachaar/status/1879154648327811134 Building a multi-tenant RAG app with easy integrations
RAG https://x.com/akshay_pachaar/status/1878916141122462139 MemoRAG enhances RAG with long-term memory capabilities
RAG https://arxiv.org/pdf/2412.15605v1 Cache-augmented generation (CAG) as an alternative to RAG

References (contd)